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Centralized and Distributed Newton Methods for 
Network Optimization and Extensions 

Dimitri P. Bertsekasf 
Abstract 

We consider Newton methods for common types of single commodity and multi-commodity network 
flow problems. Despite the potentially very large dimension of the problem, they can be implemented using 
the conjugate gradient method and low-dimensional network operations, as shown nearly thirty years ago. 
We revisit these methods, compare them to more recent proposals, and describe how they can be implemented 
in a distributed computing system. We also discuss generalizations, including the treatment of arc gains, 
linear side constraints, and related special structures. 


1. INTRODUCTION 

A common type of nonlinear network flow optimization involves a graph with a set of directed arcs A, and a 
set of paths V. Each path p consists of a set of arcs Ap c A. A set of path flows x = {xp | p S P} produces 
a flow at each arc a € A, 

fa = 'y ^ Xp. 

{p\aeAp} 

The objective is to find x that minimizes 

— X! Rpi^p) + X! Da{fa), (1-1) 

pev aeA 

where Da and Rp are twice continuously differentiable functions. Often either Rp = 0 or else Rp encodes 
a penalty for a range of values of Xp (such as small nonnegative values), while Da encodes a penalty for 
a range of values of fa (such as values close to a “capacity” of arc a). The minimization may be subject 
to some constraints on x, such as upper and lower bounds on each Xp, supply/demand constraints (sum of 
path flows with the same origin and destination must take a given value), side constraints etc. Sometimes 
capacity constraints on the arc flows are imposed explicitly rather than through the functions Da- This is 
a standard framework, discussed for example in the author’s works ([BeT89], Section 7.6, [BeG92], Section 
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5.7.3, [Ber98], pp. 391-398), and in many other works, including the survey [F1H95], which contains extensive 
references, and textbooks such as [Roc84] and [Pat98]. 

There are several methods for solving this problem. Among first-order gradient-based methods, we note 
the conditional gradient/Frank-Wolfe method [FGK73], [F1H95], gradient projection methods with and with¬ 
out diagonal scaling [Gal77], [BerSO], [BGG84], [BeG92], and simplicial decomposition/inner approximation 
methods [GaG74], [Hol74], [FIoh77], [FILV87], [Pat98], [BeYlO]. Gradient projection methods are well-suited 
for distributed implementation, and several such methods have been proposed for routing and flow control, 
dating to the early days of data networking (see [Ber80], [GaG80], [BGG84], [TsB86]). Similar methods have 
also been proposed for traffic equilibrium analysis, where the same types of optimization problems arise (see 
the survey [F1H95] and the book [Pat98]). 

In this note, we focus on second order/Newton-type methods, which have also been proposed since the 
early days of data networking, but have received less attention, possibly because of the emphasis on simple 
methods in large-scale optimal network flow problems. A common concern with Newton-type methods is that 
they require the calculation of the gradient and Hessian of F, and the solution of some large-scale quadratic 
optimization problem, either unconstrained or constrained. However, what has been sometimes overlooked 
is that one may exploit special problem structure to calculate the Newton direction without forming or 
inverting the large-dimensional Hessian matrix of F. This idea is quite common in the numerical analysis 
and interior point method literature, where the Newton direction is often calculated using essentially iterative 
methods, such as conjugate direction, splitting, and Krylov subspace methods. We will discuss primarily 
the unconstrained case and refer to the literature for various ways to handle constraints, either directly or 
through the use of penalty and augmented Lagrangian functions. Diagonal second-order preconditioning 
may also be used to make the method more effective, and to facilitate the choice of stepsize. 

Our point of departure is a fact known since the papers by Bertsekas and Gafni [BeG83], [GaB84], 
which proposed Newton-type methods for various types of constrained optimization problems, including 
multicommodity flow problems that involve nonnegativity constraints on x and supply/demand constraints. 
Reference [BeG83] showed that for any x, the pure Newton direction y^r, which satisfies 

V^F{x)yN = -VF{x), (1.2) 

can be conveniently calculated by using the conjugate gradient method, and graph operations that do not 
require the explicit formation and storage of V^F{x) or its inverse. Reference [GaB84] embedded this idea 
within a broader class of two-metric gradient projection algorithms, and provided computational results. 
In practice, the Newton direction should be approximated by only a limited number of conjugate gradient 
iterations, thereby obtaining an approximate Newton direction, which however is still a descent direction, 
regardless of how many conjugate gradient iterations are performed. Diagonal second-order preconditioning 
may be used to make the method more effective, and to facilitate the choice of stepsize. 
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The computational complexity per iteration of the conjugate gradient method is 0{T), where T is the 
total number of arcs traversed by the paths in V. The complexity of computing the Newton direction is thus 
0{PT), where P is the number of paths in V (assuming the worst case number P for the number of conjugate 
gradient iterations to compute the Newton direction). If the “average” number of arcs T/P traversed by 
a path is much smaller than P, the conjugate gradient-based 0{PT) computation is much faster than the 
0{P^) computation required to obtain the Newton direction by direct matrix inversion. If only m conjugate 
gradient iterations are performed to compute approximately the Newton direction, the total required number 
of computational operations is 0{mT). These estimates should be compared with the computation required 
to compute the gradient of the cost function and to perform a diagonally scaled gradient iteration, which is 
0{T). 

In this note, we will also describe how the graph operations for conjugate-gradient based Newton 
direction calculation can be done in a distributed computing environment, where there is a processor assigned 
to each path and a processor assigned to each arc. Then the Newton direction can be calculated simply, with 
each processor updating, maintaining, and communicating to other processors just a few numbers. Moreover 
the processor assigned to path p computes the corresponding component pp of the Newton direction, and may 
execute (in synchrony with the other processors) its local portion of a pure Newton iteration. The processors 
may also collaborate to compute a stepsize guaranteeing descent at every iteration. Of course, it is possible 
to assign multiple paths and arcs to a single processor, with an attendant increase of the computation and 
communication load of the processor. Information exchange between the processors may be conveniently 
and accurately performed by using a spanning tree with a designated node serving as a synchronizer and 
leader for the distributed computation (see [BeT89], [BeG92]). 

After describing the centralized and distributed methods for calculating the Newton step, we explore 
various generalizations, involving arc gains, side constraints, etc. Part of this discussion has not explicitly 
appeared in the literature so far, and we record it here for the purpose of making a connection with recent 
works on Newton-like methods for distributed routing and flow control in data networks [AtLOO], [JOZ09], 
[WOJIO]. At the same time, we should note that most of what we describe, including the central idea, was 
essentially known in 1983. 


2. CENTRALIZED COMPUTATION OF THE NEWTON STEP 

We focus on a fixed path flow vector x and to simplify notation, we suppress the arguments x, Xp, and 
fa from various function expressions. We show how to calculate the Newton step of Eq. (1.2) using the 
conjugate gradient method and graph-based operations. We write 

g = VF{x), H = V^F{x), 
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and for each path p, we denote by gp and Hpp the corresponding component of g and diagonal component 
ofH. 

We have 

gp = R'p + Y, D'a, p G P, (2.1) 

d^Ap 

where Rp and D'a denote the first derivatives of Rp and Da, calculated at the current arguments Xp and fa, 
respectively. Moreover 

H = '^^R + E'V'^DE, (2.2) 

where is the diagonal matrix with the second derivatives Rp along the diagonal, V^D is the diagonal 
matrix with the second derivatives Da along the diagonal, and E is the matrix with components 


Eap — 


1 if a G Ap, 
0 if a ^ Ap. 


Of particular interest are the diagonal elements of iJ, which are 


Hpp — Rp + 'y ( Da, p G P. (2-3) 

d^Ap 

Note that the computation of gp and Hpp, collectively for all p, requires a total of 0{T) computational 
operations. 

By exploiting the special structure of g and H, we can perform some computations of interest using 
graph operations. These are: 

(a) For each path p, we can calculate gp by accumulating first derivatives along p [cf. Eq. (2.1)]. Moreover, 
we can similarly calculate the diagonal elements Hpp by accumulating second derivatives along p [cf. 
Eq. (2.3)]. The number of required computational operations is 0{T). 

(b) For any vector v = {vp \ p G V}, we can calculate the matrix-vector product 


w = Hv 


[cf. Eq. (2.2)] as the gradient of the function 


^v'Hv — ^ RpVp + 5 'Y^ Dafa,v 
peP a^A 


where 


fa,v — '^P 

{p\aeAp} 


G A. 


(2.4) 
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Thus, once we calculate fa,v for all a G ^ by accumulating Vp along path p [an 0{T) computation], we 
can obtain the pth component Wp of w = Hv a.s 


Wp — RpVp T ^ ] Ra f a,v ^ P ^ ^ (^■^) 

O-G-Ap 

using a similar computation to the gradient (2.1) and the diagonal Hessian components (2.3). Again 
the total number of required computational operations is 0{T). 


We now turn to the calculation of the Newton direction by viewing it as the solution of the quadratic 
problem 

min C{y) g'y + ^y'Hy. (2.6) 

y 2 

The idea is to solve this problem with the conjugate gradient method, which requires just inner products and 
matrix-vector products of the form Hv that can be done using graph operations, without explicit formation 
of H, as we have seen. We assume that H is positive definite so that the Newton direction is defined 
and the subsequent conjugate gradient algorithm has guaranteed termination, but this assumption is not 
essential for constrained versions of Newton-type methods, and for problems of practical interest, related 
conjugate-gradient based algorithms can be obtained (see e.g., [BeG83], [GaB84]).f 


In particular, we start the iterative process with j/o = 0 as the initial iterate for the minimum of the 
quadratic cost C, with the gradient of C at j/ = 0, which is tq = g, and with po = —g, which we view 
as the first conjugate direction. Given the current iterate-gradient-conjugate direction triplet {yk,Tk,Pk), 
we generate the next iterate-gradient-conjugate direction triplet {yk+ijrk+i,Pk+i) by a conjugate gradient 
iteration, which has the following form (see e.g., [Lue84], [Ber99], Section 1.6): 


Vk+l = Vk + O-kPk, 


where ak 


r'k^k 
p'kHpk ’ 


rfc+i = g + Hyk+i, 

’k'k+i'kk+l 

Pk +1 = -rk +1 + PkPk, where Pk = --• 


Here ak is the stepsize that minimizes C{y) over the line {yk+ctpk \ a € 5R}; it can also be written equivalently 
(but less conveniently) as ak = —p'f.rk/p'f.Hpk (since pk = —rk + f3kPk-i and is orthogonal to Pk-i, ■ ■ ■ ,Po, 


I If H is positive semidefinite but singular, either the problem (2.6) has an infinite number of optimal solutions, 
in which case the conjugate gradient method will find one in a finite number of iterations (at most P minus the 
dimension of the optimal solution set), or else problem (2.6) has no solution, in which case for some k, the stepsize au 
in the subsequent conjugate gradient method will become “infinite”along a conjugate direction pk where p'^Upk — 0. 
Then by moving along the line yk + apk, we will be continuously decreasing the cost C(y), and any vector of the form 
Pk -b apk with a > 0 will be a direction of descent of Fix) at x, which may be used in place of the Newton direction. 
Conjugate gradient methods are often implemented with such safeguards against singularity of H, particularly in 
constrained contexts. 
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a basic property of the conjugate gradient method). The matrix-vector products to be computed at each 
iteration are Hpk and Hyk+i, and they can be obtained by using Eqs. (2.4)-(2.5), with v = pk and v = yk+i, 
respectively. 

According to well-known theory [Lue84], [Ber99], either yk+i minimizes C{y) (i.e., rk+i = 0) and hence 
yk+i is equal to the Newton direction. Moreover (unless 5 = 0) we have 

Ciyk) < C{yo) = 0, V A: > 0, 


so that 


g'Vk < -y'kHyk <0, V fc > 0. 


(2.7) 


We may either let the process terminate naturally (which will happen after a number of iterations no larger 
than the number of paths P), or more practically, terminate once a certain termination criterion is satisfied, 
in which case yk is a descent direction by Eq. (2.7). 

An important variant, which is usually far superior in practice, is to use a preconditioning matrix S, 
which is a diagonal approximation to the Hessian matrix, i.e, S is diagonal with Hpp along the diagonal. 
This method starts with j/o = 0, ro = g, and po = —Sg, and has the form 


Vk+l Vk 4 “ ^kPki 

Tfc+i = g + Hyk+i, 


where Uk 


r'f^Srk 
p'kHpk ’ 


Pk+l — Svk+l ~\~ f^kPki 


where Pk 


r'k+iSrk+i 

r'kSrk 


The non-preconditioned method bears the same relation to the preconditioned version that the gradient 
method bears to the diagonally scaled gradient method, with second-order diagonal scaling. The use of 
this type of preconditioning not only improves the rate of convergence (typically), but also facilitates the 
choice of stepsize (a stepsize of 1 typically works, regardless of the number of conjugate gradient iterations 
used). These advantages of preconditioning have been confirmed by extensive computational experience, 
although theoretically speaking there are rare exceptions where diagonal second-order preconditioning does 
not improve performance. 

A somewhat different type of diagonal preconditioning can be used with advantage in the case where 
the number of arcs, call it A, is substantially smaller than the number of paths P. Then by using S = V'^R{x) 
as a preconditioning matrix, it can be shown that the number of conjugate gradient iterations to find the 
Newton direction is at most A, as noted in [Ber74] (see also [Ber99], p. 148). 

In the preceding methods, each conjugate gradient iteration involves a small number of inner products 
and matrix-vector multiplications, each of which requires no more than 0{T) computation. Thus the to¬ 
tal required number of computational operations to compute approximately the Newton direction with m 
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conjugate gradient iterations is 0{mT), compared to 0{T) for a single gradient calculation and diagonally 
scaled gradient iteration. 

The choice of number of conjugate gradient steps to obtain an approximate Newton direction is an 
interesting practical implementation issue, and may ultimately be settled by experimentation. The paper 
[DES82] shows how to control this number so that a linear, superlinear, or quadratic convergence rate for 
the overall method is achieved. Generally, to attain a superlinear rate, the conjugate gradient process for 
approximating the Newton step must become asymptotically exact, with the number of conjugate gradient 
steps per iteration approaching P. This appears highly inefficient for practical large network flow prob¬ 
lems, where P is large and just a few conjugate gradient steps per iteration are sufficient to attain a good 
convergence rate, particularly when second order diagonal preconditioning is used. 


3. DISTRIBUTED COMPUTATION OF THE NEWTON STEP 

The algorithm of the preceding section can also be executed in a distributed fashion, assuming that there is 
a processor assigned to each path and a processor assigned to each arc. We may also assign multiple paths 
and arcs to a single processor, with an attendant increase of the computation and communication load of 
the processor, but for simplicity we do not consider this possibility. 

To compute the Newton step, for each p, the processor assigned to path p may update, maintain, and 
communicate to other processors its own path variable Xp, and compute the corresponding component yp of 
the Newton direction. For each a, the processor assigned to arc a may compute and communicate to the 
relevant path processors, its accumulated flow variable fa- Similar computations are used to execute the 
intermediate conjugate gradient iterations. The path processors may also collaborate to compute a stepsize 
guaranteeing descent at every iteration, although in many network applications a constant stepsize can be 
used reliably, as experience has shown [BGG84], [GaB84] (equal or nearly equal to one in our case). 

Information exchange between the processors may be conveniently and accurately performed by using 
a spanning tree with a designated node serving as a synchronizer and leader for the distributed computa¬ 
tion. Schemes of this type have been used extensively in data networks and distributed computing systems. 
Recently, iterative consensus schemes [TBA86], [BeT89] have been discussed as possible methods for in¬ 
formation exchange between processors, but these schemes are slow relative to Newton’s method, so they 
are not suitable for our context. On the other hand, it should be noted that in certain application con¬ 
texts, the need for synchronization to perform Newton and conjugate gradient iterations may be a major 
drawback over asynchronous diagonally scaled gradient and gradient projection methods, which can be 
implemented asynchronously with satisfactory convergence properties (see [Ber83], [BeT89] for totally asyn¬ 
chronous gradient-like methods, and [TsB86], [BeT89] for partially asynchronous and stochastic gradient-like 
methods). 
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4. ALGORITHMS, CONVERGENCE, AND RATE OF CONVERGENCE 


There are several network flow problem formulations and corresponding algorithms, where the Newton 
direction computations of the preceding two sections may be potentially used. An important issue is the 
treatment of constraints in this context. Here are some potential approaches, leading to viable Newton-type 
algorithms: 

(a) Constraints may be eliminated via a penalty or augmented Lagrangian approach to yield an uncon¬ 
strained problem of minimizing a function F(x) of the form (1.1). The convergence analysis of such 
approaches requires (1) a convergence guarantee for the unconstrained optimization of the penalized or 
augmented Lagrangian objective, and (2) a convergence guarantee for the overall penalty or augmented 
Lagrangian objective. These guarantees may be obtained in straightforward fashion as applications of 
standard results for gradient-related and Newton-like methods for unconstrained minimization, as well 
as standard analyses of penalty and augmented Lagrangian methods (see e.g., [Ber82a], [Ber99]). 

(b) Nonnegativity constraints can be treated with two-metric Newton-like methods that require computa¬ 
tion of Newton directions such as the ones of the preceding two sections. These are methods involving 
partial diagonalization of the Hessian matrix, for which a detailed convergence and rate of convergence 
analysis is given in [Ber82b], [BeG83], [GaB84]. 

(c) Simplex constraints, relating to supply/demand specifications, can be used to eliminate some of the 
variables and essentially reduce the constraint set to the nonnegativity case of (b) above (see again 
[Ber82b], [BeG83], [GaB84] for convergence and rate of convergence analysis). 

Regardless of how constraints are treated, for a convergent algorithm one must address the issue of 
stepsize selection. Experience has shown that in network algorithms one may often use reliably a constant 
stepsize, particularly when the directions used embody second order information, which makes a stepsize 
close to one a typically good choice. An alternative to a constant stepsize is a line search rule based on 
line minimization or successive stepsize reduction. References [Ber82b], [BeG83], [GaB84] provide examples 
of successive stepsize reduction rules in conjunction with constraints. The use of such rules improves the 
reliability of algorithms but introduces additional complexity, particularly in a distributed context. In the 
latter case, it is possible to implement successive stepsize reduction rules at a special processor that may 
exchange information with other processors in a distributed way. 

We do not discuss convergence issues further. The aim of this paper is not to provide specific algorithms 
and associated convergence analysis (which is routine for the most part, as well as problem dependent), but 
rather to make the point that the network structure can be used to implement the computation of Newton 
directions in a convenient centralized or distributed manner. 
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5. EXTENSIONS 


There are several problem generalizations, involving in some cases an extended network or even a non-network 
structure, which admit a treatment similar to the one of the preceding sections: 

(a) Cost functions F that are defined over just a subset of the path flow space arise when constraints are 
eliminated by means of an interior point method approach. The use of Newton directions within this 
context is well-documented. 

(b) More general linear dependence of arc flows on path flows can be treated by generalization of the terms 
Da{fa) in the cost function. In particular, we may redefine fa to be a general linear function c'aX where 
Co is some vector. For example the scalar components of Co may represent arc gains. In this case, the 
entire approach of the preceding two sections generalizes straightforwardly. What is essential is that 
the Hessian of F should have the generic form 

V^R + E'SJ^DE 

of Eq. (2.2), with and V^D being diagonal matrices, and E being a matrix that encodes a linear 
dependence between x and the arguments fa of the cost terms Da{fa)- 

(c) Linear side constraints may be treated by using a penalty or augmented Lagrangian approach, thereby 
reducing to case (b) above. 

(d) The basic Hessian structure that is important for the convenient computation of gradients and Hessian 
matrix-vector products is 

E'DE, 

where D and E are matrices with D diagonal. Therefore our methodology will also work for structures 
of the form 

R + E[DiEi -|- • • • -f E'mDmEra, 

where R, and Ei,...,Em are matrices with R and Di,...,Dm diagonal. There are 

also potential extensions in cases where R and Di ,..., Dm are symmetric and nearly diagonal (e.g., 
tridiagonal). 

An important question is how to deal with singularity of the Hessian matrix and the attendant lack of 
strong convexity. This arises for example in the important case where V^i? = 0, and there are constraints 
on X (nonnegativity and/or supply/demand constraints), which guarantee existence of a solution. While in 
this case subsequence convergence of our methods to minima is easy to show under standard assumptions, 
the convergence (to a single point) and the establishment of a linear or superlinear convergence rate result 
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are open questions. By contrast, the issue of convergence to a single point and linear convergence rate in 
the presence of Hessian singularity has been satisfactorily addressed for gradient projection methods in the 
context of variational inequalities, including multicommodity flow problems [BeG82]. 
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